Streaming k-means on Well-Clusterable Data

نویسندگان

  • Vladimir Braverman
  • Adam Meyerson
  • Rafail Ostrovsky
  • Alan Roytman
  • Michael Shindler
  • Brian Tagiku
چکیده

One of the central problems in data-analysis is k-means clustering. In recent years, considerable attention in the literature addressed the streaming variant of this problem, culminating in a series of results (Har-Peled and Mazumdar; Frahling and Sohler; Frahling, Monemizadeh, and Sohler; Chen) that produced a (1 + ε)approximation for k-means clustering in the streaming setting. Unfortunately, since optimizing the k-means objective is Max-SNP hard, all algorithms that achieve a (1 + ε)-approximation must take time exponential in k unless P=NP. Thus, to avoid exponential dependence on k, some additional assumptions must be made to guarantee high quality approximation and polynomial running time. A recent paper of Ostrovsky, Rabani, Schulman, and Swamy (FOCS 2006) introduced the very natural assumption of data separability : the assumption closely reflects how k-means is used in practice and allowed the authors to create a high-quality approximation for k-means clustering in the non-streaming setting with polynomial running time even for large values of k. Their work left open a natural and important question: are similar results possible in a streaming setting? This is the question we answer in this paper, albeit using substantially different techniques. We show a near-optimal streaming approximation algorithm for k-means in high-dimensional Euclidean space with sublinear memory and a single pass, under the same data separability assumption. Our algorithm offers significant improvements in both space and run∗Computer Science Department, UCLA, [email protected]. Supported in part by NSF grants 0830803, 0916574. †Computer Science Department, UCLA, [email protected]. Research partially supported by NSF CIF Grant CCF-1016540. ‡Computer Science and Mathematics Departments, UCLA, [email protected]. Research partially supported by IBM Faculty Award, Xerox Innovation Award, the Okawa Foundation Award, Intel, Teradata, NSF grants 0830803, 0916574, BSF grant 2008411 and U.C. MICRO grant. §Computer Science Department, UCLA, [email protected]. Research partially supported by NSF CIF Grant CCF-1016540. ¶Computer Science Department, UCLA, [email protected] ‖Computer Science Department, UCLA, [email protected] ning time over previous work while yielding asymptotically best-possible performance (assuming that the running time must be fully polynomial and P 6= NP). The novel techniques we develop along the way imply a number of additional results: we provide a high-probability performance guarantee for online facility location (in contrast, Meyerson’s FOCS 2001 algorithm gave bounds only in expectation); we develop a constant approximation method for the general class of semi-metric clustering problems; we improve (even without σ-separability) by a logarithmic factor space requirements for streaming constant-approximation for k-median; finally we design a “re-sampling method” in a streaming setting to convert any constant approximation for clustering to a [1 + O(σ)]-approximation for σ-separable data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Aposteriorical Clusterability Criterion for k-Means and Simplicity of Clustering

We define the notion of a well-clusterable data set from the point of view of the objective of k-means clustering algorithm and common sense. The novelty introduced here is that one can a posteriori (after running k-means) check if the data set is well-clusterable or not.

متن کامل

Clustering Oligarchies

We investigate the extent to which clustering algorithms are robust to the addition of a small, potentially adversarial, set of points. Our analysis reveals radical differences in the robustness of popular clustering methods. k-means and several related techniques are robust when data is clusterable, and we provide a quantitative analysis capturing the precise relationship between clusterabilit...

متن کامل

Scalable constant k-means approximation via heuristics on well-clusterable data

We present a simple heuristic clustering procedure, with running time independent of the data size, that combines random sampling with Single-Linkage (Kruskal’s algorithm), and show that with sufficient probability, it has a constant approximation guarantee with respect to the optimal k-means cost, provided an optimal solution satisfies a center-separability assumption. As the separation increa...

متن کامل

Which Data Sets are ‘Clusterable’? – A Theoretical Study of Clusterability

We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong’ or ‘conclusive’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specific data generation model. We survey several notions of clusterability that have been discussed in th...

متن کامل

Testing of Clustering

A set X of points in < d is (k; b)-clusterable if X can be partitioned into k subsets (clusters) so that the diameter (alternatively, the radius) of each cluster is at most b. We present algorithms that by sampling from a set X , distinguish between the case that X is (k; b)-clusterable and the case that X is-far from being (k; b 0)-clusterable for any given 0 < 1 and for b 0 b. In-far from bei...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011